Data Visualization
Module 2: Workbook 5
Introduction
Welcome to Notebook 5 of Module 2! At this point in our notebook series, we have built out our descriptive analysis, and are now think about the findings and how to appropriately convey them. For outputs deemed best displayed in an image, we may have started on some initial plots in ggplot2
, largely relying on its base functionality. Here, we will show you different ways you can leverage the powerful ggplot2
package to create presentation- and publication-quality data visualizations from our descriptive analysis. We will also discuss different visualization options based on the type of the analysis.
We will cover the following visualizations in this notebook:
- Density Plot: is very useful for showing the distribution of a variable, and is a more continuous version of a histogram
- Line Plot: is typically used for time series data to show how a variable changes over time
- Bar Plot: visualizes relationships between numerical and categorical variables
- Heat Map: shows geographical variations in a variable using graded differences in color
Technical setup
As in previous notebooks, we will reintroduce the code required to set up our environment to connect to the proper database and load certain packages. If you aren’t concerned with the technical setup of this workbook, please feel free to skip ahead to the next section, Loading our analytic frame.
Load libraries
We will start by loading necessary packages not readily available in the base R setup.
As a reminder, every time you create a new R file, you should copy and run the following code snippet.
options(scipen = 999) # avoid scientific notation
library(RJDBC)
library(tidyverse)
Establish database connection
The following set of commands will set up a connection to the Redshift database:
=Sys.getenv("DBUSER")
dbusr=Sys.getenv("DBPASSWD")
dbpswd
<- "jdbc:redshift:iam://adrf-redshift11.cdy8ch2udktk.us-gov-west-1.redshift.amazonaws.com:5439/projects;loginToRp=urn:amazon:webservices:govcloud;ssl=true;AutoCreate=true;idp_host=adfs.adrf.net;idp_port=443;ssl_insecure=true;plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider"
url
<- JDBC(
driver "com.amazon.redshift.jdbc42.Driver",
classPath = "C:\\drivers\\redshift_withsdk\\redshift-jdbc42-2.1.0.12\\redshift-jdbc42-2.1.0.12.jar",
identifier.quote="`"
)
<- dbConnect(driver, url, dbusr, dbpswd) con
For this code to work, you need to have an .Renviron
file in your user folder (i.e. U:\\John.Doe.P00002
) containing your username and password.
Loading our analytic frame
As we did in the previous notebook, can recreate our analytic frame by using SQL joins to filter the fact table to only include our cohort members.
<- "
qry select f.*
from tr_wi_2023.nb_cohort c
join tr_wi_2023.wi_mdim_person p on (c.ssn = p.ssn)
join tr_wi_2023.wi_fact_weekly_observation f on (p.person_id = f.person_id)
"
<- dbGetQuery(con, qry) analytic_frame
Visualization
This initial section is quite technically-focused. If you’d like, you can skip to the Density plot subsection.
Recall the structure of traditional ggplot2
syntax:
start with the
ggplot()
statementthen, supply a dataset and aesthetic mapping with
x
pertaining to the variable on the x-axis, and so on, for example:ggplot(dataset, aes(x = ..., y = ...)
from there, provide a geometry type for your plot, represented by
geom_*
to convey the desired type of visualization, for examplegeom_line
will plot a line,geom_point
will plot pointsfinally, add additional layers if necessary using
+
, which we will use to add annotations and other customization to the plot, including adding labels and titlesIf you like using the other tidyverse packages like
dplyr
, we can connect our data processing and summary workflow directly toggplot()
using the pipe operator%>%
Density plot
To illustrate the density plot, we will evoke one we created when initially exploring our analytic frame in the Data Model and Record Linkage notebook.
On this plot, the y-axis will show the relative frequencies of cohort members with the corresponding number of weeks of claimed benefits and received benefits in this specific benefit year. Compared to a histogram, the density plot is a smoother representation, referred to as a kernel density.
Recall the initial code we wrote, where one argument in geom_density()
was used to differentiate the overlapping plots:
<- analytic_frame %>%
plot_data filter(benefit_yr_start == "2022-03-20") %>%
group_by(person_id) %>%
summarize(
n_weeks_claimed = sum(benefit_claimed == "Y"),
n_weeks_received = sum(normal_benefit_received == "Y")
%>%
) ungroup()
# make longer for ease in legend creation in ggplot
%>%
plot_data pivot_longer(
cols = starts_with("n"),
names_to = "stat",
values_to = "weeks"
%>%
) ggplot(aes(x=weeks, fill=stat)) +
geom_density(alpha = .25) # alpha changes transparency
Depending on the features you hope your audience focuses, the choice of overlapping densities may not be the wisest decision. Although it may be helpful in explaining the relative differences in distributions for weeks claimed compared to weeks received, it may be difficult to glean much else, such as the distribution of weeks received for this particular cohort in a specific benefit year.
In that case, we may opt for a simpler initial plot:
%>%
plot_data ggplot(aes(x=n_weeks_received)) +
geom_density()
We can further modify our base plot to include more informational elements, like titles, labels, and other annotations. TO begin, we will specify the following:
Plot title: We want a simple statement that conveys the major takeaway(s) of the graph.
Axis labels: To further allow our audience to understand what is being plotted, we will provide well-formatted labels for our axes.
Data source annotation: Providing clear reference and source of the underlying data used for the visualization can increase the credibility and enable the reproducibility of your results. Additionally, if you want, you can also identify the analyst responsible for creating the majestic data visualization.
A handy way to easily modify a plot is to first create a ggplot
object from our base plot before adding layers to it.
Note: When initially adding new layers, we recommend that you do not overwrite the
ggplot
object until you are satisfied with the result of the layer.
<- plot_data %>%
d_plot ggplot(aes(x=n_weeks_received)) +
geom_density()
d_plot
We can add a title and axis labels with the labs()
function. title
adds a title to the plot and x
and y
provide labels for their respective axes.
We can also add a caption to the bottom, using caption
, to properly attribute the visual by its dataset and perhaps the visualization developer. Likewise, we can add a subtitle to add additional description to our plot using subtitle
.
<- d_plot +
d_plot labs(title = "Many Claimants Received Less Than REDACTED Weeks of UI Benefits",
subtitle = "Density Plot of Number of Weeks Benefits were Received in their 2022 Benefit Year",
x = "Number of weeks",
y = "Density",
# \n is new line
caption = "Source: Wisconsin PROMIS data \n Created by Irma Analyst, Ph.D.")
d_plot
This is much better than what we started with, but we can add additional refinements to the plot, such as adding a marker showing the median number of weeks, and modifying the overall theme of the plot.
<- plot_data %>%
median_d_plot summarize(median = quantile(n_weeks_received, .5)) %>%
pull()
<- d_plot +
d_plot geom_vline(xintercept = median_d_plot,
linetype = "dotted",
color = "red",
size = 1.5) +
theme_classic()
d_plot
This marker does not mean much without context - we can add further text annotations on the plot using the annotate()
function. In this function, we specify what we want to add to the plot, in terms of text and the x & y coordinates on the plot.
This can sometimes require a little trial and error to get the exact x and y coordinates so they appear like you want on the plot, as we often place the text to the side of the curve to avoid overlap.
<- d_plot +
d_plot annotate(geom = "text",
x= median_d_plot+2.5,
y = .08,
color = "red",
label = "Median")
d_plot
This plot has improved substantially with some minimal addition to our code, and we can continue to use these additional elements as needed in any other kind of ggplot
graph we make.
Exporting plots
An important, but not always obvious aspect of creating plots in R is getting them exported. If your plot is made with ggplot
, then you can use the ggsave()
function to save it to any number of different image formats for export.
You can specify the size of the file, the image type and the resolution of the image very easily. If you choose not to store your plots as objects in R, e.g.(d_plot
from our example), then ggsave()
will automatically save the last plot you generated, otherwise you can give it a plot object name to save a specific plot. Here, we save our plot as a .png
format (very common for web and document-bound images), with a print resolution, and size 5 inches high by 7 inches wide.
ggsave(d_plot,
filename = 'WI_dens_plot.png',
dpi = "print",
width = 7,
height = 5)
Line plot
Our next plot type we will work with is a line plot. A line plot looks similar to a density plot, which also used a line to show the values of our summary, but is a much more general way to show data, especially over time. In this section, we will build off a line chart we generated in the Quarterly Wages section of the Measurement notebook, which displays average wages by quarter relative to the start of their 2022 benefit year based on the nature of their UI claim history.
The code required to develop this plot is quite extensive, and may be more simply accessed through the measurement notebook - we will still copy all of this code in the cell below.
<- analytic_frame %>%
spell_volume_measure filter(benefit_yr_start == "2022-03-20") %>%
group_by(person_id) %>%
summarize(
n_weeks_claimed = sum(benefit_claimed == "Y"),
%>%
) ungroup() %>%
mutate(
spell_volume = case_when(
< quantile(n_weeks_claimed, probs = .25) ~ "low",
n_weeks_claimed >= quantile(n_weeks_claimed, probs = .25) ~ "high"
n_weeks_claimed
),spell_volume = factor(spell_volume, c("low", "high"), ordered = TRUE) # set as factor
%>%
) select(-n_weeks_claimed)
<- analytic_frame %>%
claim_frequency_measure # only focused on observations where benefits were claimed
filter(benefit_yr_start == "2022-03-20", benefit_claimed == "Y") %>%
group_by(person_id) %>%
summarize(
n_weeks_claimed = n(),
first_week_claimed = min(week_id),
last_week_claimed = max(week_id)
%>%
) mutate(
# add one because range is inclusive
duration = last_week_claimed - first_week_claimed + 1,
claim_frequency = if_else(
== n_weeks_claimed,
duration "continuous",
"stuttered"
)%>%
) ungroup() %>%
select(person_id, claim_frequency)
<- claim_frequency_measure %>%
measures inner_join(spell_volume_measure, by = "person_id")
<- analytic_frame %>%
quarters_in_range distinct(calendar_year, calendar_quarter) %>%
filter(
== 2021 & calendar_quarter %in% c(2,3,4) | calendar_year == 2022
calendar_year %>%
) arrange(calendar_year, calendar_quarter) %>%
mutate(
quarter_from_entry = row_number() - row_number()[calendar_year == 2022 & calendar_quarter == 1]
)
With the proper data frames now available in our environment, we can re-run the code snippet used to create the preliminary line chart, saving it to l_plot
.
<- analytic_frame %>%
l_plot inner_join(quarters_in_range, by = c("calendar_year", "calendar_quarter")) %>%
filter(employed_in_quarter == "Y") %>%
distinct(person_id, quarter_from_entry, total_wages) %>%
# add in person-level measures data frame
inner_join(measures, by = "person_id") %>%
group_by(quarter_from_entry, spell_volume, claim_frequency) %>%
summarize(
avg_wages = mean(total_wages)
%>%
) ungroup() %>%
ggplot(aes(x=quarter_from_entry,
y = avg_wages,
linetype = spell_volume,
color = claim_frequency)) +
geom_line()
l_plot
Let’s start by applying some of the same techniques from before to l_plot
.
# update titles, change theme
# can update legend titles by assigning titles to ggplot aesthetics
<- l_plot +
l_plot labs(
title = "Claimants with REDACTED Spell Volumes Earn REDACTED in the Quarters Pre- and \nPost- Benefit Entry, On Average",
x = "Quarter Relative to UI Benefit Start Year (March 2022)",
y = "Average Quarterly Wages",
subtitle = "Average Quarterly Wages by Benefit Characteristics Relative to 2022 UI Benefit Start Year",
caption = "Source: WI PROMIS and UI Wage data \n Created by Irma Analyst, Ph.D.",
color = "Claim Frequency",
linetype = "Claim Volume"
+
) theme_classic()
l_plot
Note that in the previous plot, because it did not require a legend, the caption was already right-aligned. We can enforce the same standard with the existence of a legend by updating the caption’s position.
# default aligns to plot panels, "plot" aligns to entire plot
<- l_plot +
l_plot theme(
plot.caption.position = "plot"
)
l_plot
Because the plot features different colors and line types, we can adjust the default values to better differentiate between the four lines, making them more accessible.
scale_color_brewer()
provides accessible color schemes from ColorBrewer, with options for different variable relationships (sequential, diverging, qualitative). Here, our subgroups are qualitative, so we will opt for one of the qualitative palette options.
We can also update the thickness of each line by adjusting the size
parameter of geom_line()
, with its default at 1. In newer versions of ggplot2
, the size
parameter has been separated into a size
aesthetic for handling sizing, with linewidth
controlling the width.
Note: We recommend you enforce consistent aesthetic choices for the same subgroups across plots (ex. keep the colors for claim frequency).
<- l_plot +
l_plot scale_color_brewer(palette = "Dark2") +
geom_line(size = 1.3)
l_plot
We can further improve the clarity of the visualization by adjusting the axes. Specifically, we can update the tick marks on the x-axis to reflect key points, which are all seven quarters (-3 to 3), as opposed to just -2, 0, and 2. Additionally, we can expand the range of the y-axis to start at 0.
<- l_plot +
l_plot # start y-axis at 0
expand_limits(y=0) +
# change x-axis tick mark frequency
scale_x_continuous(
breaks = seq(from = -3, to = 3, by= 1)
)
l_plot
Finally, if we wanted to highlight specific values on the line - say, at the end, we can do so using the ggrepel
package, which ensures text labels from overlapping. In this case, because we want to update text, instead of using geom_text()
, we will use geom_text_repel()
.
Since we just want to highlight the final values on the line, rather than all values, we can filter our initial data frame to values at the end of the plot (quarter_from_entry == 3
), and use it as an input to geom_text_repel()
.
library(ggrepel)
<- analytic_frame %>%
data_ends inner_join(quarters_in_range, by = c("calendar_year", "calendar_quarter")) %>%
filter(employed_in_quarter == "Y") %>%
distinct(person_id, quarter_from_entry, total_wages, primary_employer_id) %>%
# add in person-level measures data frame
inner_join(measures, by = "person_id") %>%
group_by(quarter_from_entry, spell_volume, claim_frequency) %>%
summarize(
avg_wages = mean(total_wages),
n_people = n_distinct(person_id),
n_employers = n_distinct(primary_employer_id)
%>%
) mutate(
avg_wages = round(avg_wages)
%>%
) filter(quarter_from_entry == 3)
+
l_plot geom_text_repel(
data = data_ends,
aes(label = avg_wages),
# adjust x-axis position of text
nudge_x = .3,
# only move text in y direction to ensure horizontal alignment
direction = "y"
+
) # update scale to allow for more room on right side to fit labels
scale_x_continuous(
breaks = seq(from = -3, to = 3, by= 1),
limits = c(-3, 3.5)
)
We can then save this file in our working directory.
ggsave(l_plot,
filename = 'WI_line_plot.png',
dpi = "print",
width = 7, height = 5)
Bar Plot
Recall that a bar plot can be great for plotting relationships between numerical and categorical variables, often in situations where we want to compare relative sizes. A traditional bar plot may look like the following:
# mtcars is a built-in public dataset in R
ggplot(mpg, aes(y = class)) +
geom_bar()
Thus far, we have not generated any traditional bar graphs, instead opting for tabular displays. If you recall, though, we have used geom_bar()
in a handful of prior code snippets, with the x-axis representing fixed, continuous points (like relative week). Specifically, recall our bar plot of exit rates.
The following code regenerates this plot:
<- analytic_frame %>%
exit_rate_measure # just looking at benefit reception observations
filter(benefit_yr_start == "2022-03-20", normal_benefit_received == "Y") %>%
group_by(person_id) %>%
summarize(
last_week = max(week_ending_date),
last_week_id = max(week_id),
n_employers = n_distinct(primary_employer_id),
n_people = n_distinct(person_id)
)
<- analytic_frame %>%
benefit_start_id filter(week_ending_date == "2022-03-26") %>%
distinct(week_id) %>%
pull()
<- exit_rate_measure %>%
exit_rate_plot_data group_by(last_week, last_week_id) %>%
summarize(
n_leaving = n(),
n_employers = sum(n_employers),
n_people = sum(n_people)
%>%
) ungroup() %>%
arrange(last_week_id) %>%
#cumsum finds cumulative sum
mutate(
n_remaining = sum(n_leaving) - cumsum(n_leaving),
relative_week = last_week_id - benefit_start_id
)
ggplot(exit_rate_plot_data, aes(x = relative_week, y = n_remaining)) +
geom_bar(stat = "identity")
We can consider this a time-series visualization, and perhaps a line plot may be more suitable. That being said, for pedagogical purposes, we will continue with the visualization as a bar graph. Because we are showing counts on the y-axis, and not percentage, it may be helpful to add a horizontal line representing the 50% cutoff point. To find this, we can divide the total count (n_leaving
+ n_remaining
in the first week) by 2. We’ll also update the theme and labels in this snippet:
<- ggplot(exit_rate_plot_data, aes(x = relative_week, y = n_remaining)) +
b_plot geom_bar(stat = "identity")
# find total cohort size
<- exit_rate_plot_data %>%
cohort_size filter(relative_week == 1) %>%
summarize(
+ n_remaining
n_leaving %>%
) pull()
# graph and label horizontal line
<- b_plot +
b_plot geom_hline(
yintercept = cohort_size/2,
linetype = "dotted",
color = "red",
size = 1.5
+
) scale_x_continuous(
breaks = seq(0, 50, 5)
+
) annotate(
geom = "text",
x = 40,
y = (cohort_size/2) + 50,
color = "red",
label = "50% cutoff"
+
) # update titles
labs(
title = "The Exit Rate Slows by Week REDACTED",
x = "Week Since Benefit Year Start",
y = "Number Remaining on UI Benefits",
subtitle = "Exit Counts by Week Relative to Benefit Year Start in 2022",
caption = "Source: WI PROMIS data \n Created by Irma Analyst, Ph.D."
+
) # update theme
theme_classic()
b_plot
Lastly, for reference, we can add annotations of the number of individuals tracked at the beginning and end of the benefit year.
# find first and last week in the data
<- exit_rate_plot_data %>%
data_start filter(relative_week == 1) %>%
pull(n_remaining)
<- exit_rate_plot_data %>%
data_end filter(relative_week == 50) %>%
pull(n_remaining)
# choose annotation two weeks to the right of the bar
<- b_plot +
b_plot annotate(geom = "text",
x= 3,
y = data_start,
color = "black",
label = data_start) +
annotate(geom = "text",
x= 52,
y = data_end,
color = "black",
label = data_end)
b_plot
At the end of our transformations, we can save the resulting image.
ggsave(b_plot, filename = 'WI_bar_plot.png', dpi = "print", width = 7, height = 5)
Heat Map
As our final visualization, we will showcase a heatmap based on geography. We have not focused much on the potential geographical analyses within this data, so we will create an example inspired by an analysis in our cross-section notebook, where we found the most common industries by workforce development area for our initial cross-section.
We will modify this analysis to examine our initial cohort cross-section (easier to pull), graphing the claimant counts by the relative labor force by county in 2022 using public BLS data.
We want to create a county map for the state, so we must use the ZIP code to county crosswalk file to identify which zip codes are within which counties. This query accomplishes this:
<- "
qry select c.*, xwalk.county
from tr_wi_2023.nb_cohort c
left join tr_wi_2023.wi_rdim_zip_county_wda_xwalk xwalk on (c.res_zip = xwalk.zip)
"
<- dbGetQuery(con, qry) cohort_cross_section
We can’t simply map the numbers of claimants in each county, because larger population counties will naturally have higher numbers of claimants, so we must create a rate of sorts by normalizing to the labor force size in each county.
We have these data from the BLS, and we much merge them to our claimant data to create the rates.
First we aggregate the number of claimants by each county:
<- cohort_cross_section %>%
claims_by_county # convert to title name
mutate(county = str_to_title(county)) %>%
group_by(county) %>%
summarize(
n_claimants = n_distinct(ssn),
n_employers = n_distinct(ui_number)
%>%
) ungroup()
Next, we can merge this data frame to the BLS data, and calculate our rate, here per 10,000 people in the labor force. Ideally, we would use a county FIPS code, but these data only have county names, which requires us to manipulate some names to be the same in both data frames, specifically for Saint Croix and Fond Du Lac.
<- read_csv("P:/tr-wi-2023/Public Data/Labor Force - LAUS.csv")
labor_force
<- labor_force %>%
h_plot_data mutate(
cnty_name = word(Area, 1, sep = " County"),
cnty_name = case_when(
== "St. Croix" ~ "Saint Croix",
cnty_name == "Fond du Lac" ~ "Fond Du Lac",
cnty_name TRUE ~ cnty_name
)%>%
) # only use 2022 data since cross section is in 2022
filter(Year == 2022) %>%
# don't need rest of the variables
select(cnty_name, `Labor Force`) %>%
left_join(claims_by_county, by = c("cnty_name" = "county")) %>%
# multiply by 10000 to find rate per 10000 individuals
mutate(
claimant_rate = 10000 * coalesce(n_claimants / `Labor Force`,0)
)
head(h_plot_data)
Now that we have the rates per county, we just need to figure out how to graph them on a map of the state. To do this, we will use the sf
package, which reads spatial, or map, data into R and creates map visualizations. Public Census geographic data are available in the project folder at P:/tr-wi-2023/Public Data/Support team upload/
.
`{r} library(sf) #read GEOJSON into R as df # quiet suppresses info on name, driver, size and spatial reference counties <- st_read( "P:/tr-wi-2023/Public Data/Support team upload/county_geographies.geojson", quiet = TRUE ) %>% filter(STATEFP == 55) #filter for Wisconsin
Creating our map is easy, using the geom_sf()
geometry in ggplot()
.
# left join so we have county geography info for each county even if they did not
# have any claimants in the cross-section
<- counties %>%
h_plot left_join(h_plot_data, by = c("NAME" = "cnty_name")) %>%
ggplot() +
geom_sf(aes(fill=claimant_rate))
h_plot
We can also apply a different color scheme to the plot using various options in the available suites of color palettes to improve accessibility.
<- h_plot +
h_plot scale_fill_viridis_c()
h_plot
If we want to label the counties with the highest rates, we can use the geom_label_repel()
function, which has similar functionality relative to geom_text_repel()
.
# find counties with highest claimant rates
# top_n sorts and finds highest values
# can inner join because only including 5 counties
<- h_plot_data %>%
high_counties top_n(5, claimant_rate) %>%
inner_join(counties, by = c("cnty_name" = "NAME"))
<- h_plot +
h_plot geom_label_repel(data = high_counties,
aes(label = cnty_name, geometry = geometry),
stat = "sf_coordinates",
min.segment.length = 0)
h_plot
Like the other plots in this notebook, we can add titles and annotations to the plot using labs()
.
<- h_plot +
h_plot labs(
title = "Wisconsin Counties with the 5 highest UI Claim Rates",
subtitle = "Per 10,000 Labor force participants",
fill = "Claimants",
caption = "Source: Wisconsin PROMIS data and BLS\n Created by Irma Analyst, Ph.D."
)
h_plot
Once we are satisfied with our output, we can save the visualization.
ggsave(h_plot,
filename = 'WI_Heatmap.png',
dpi = "print",
width = 7, height = 7)
Checkpoint
Of your findings, which ones are most suitable to visualization? Why? Are there additional updates you would like to make to any of these plots?
Next steps: Applying this notebook to your project
Although this notebook is quite technical and focused on final outputs, it can still be useful as you are producing your descriptive analysis. In particular, this notebook provides a variety of display options, and you should think about the best choice and design for exhibiting your findings. You can start by creating the base plot and think about an ideal title, so you can adjust the aspects of the graph to highlight your findings for the audience. At a minimum, it will be helpful for the business-oriented members of your team if you reuse the ggsave()
code and save preliminary plots early and often, so they can provide their input on the direction of the analysis.
Additionally, we recommend revisiting this notebook as you begin preparing to export your final tables and graphs from the ADRF, so you can apply layering updates to ensure your exports are ready for your final presentation and report. There are many other ggplot2
layer aspects we did not cover in this notebook; thankfully, there are many open-source posts and examples for you to draw from as well.
Citations
Kamil Slowikowski (2021). ggrepel: Automatically Position Non-Overlapping Text Labels with ‘ggplot2’. R package version 0.9.1. https://CRAN.R-project.org/package=ggrepel
Pedersen, T. L. (2022, August 24). Make your ggplot2 extension package understand the new linewidth aesthetic [web log]. Retrieved July 28, 2023, from https://www.tidyverse.org/blog/2022/08/ggplot2-3-4-0-size-to-linewidth/.
Tian Lou, & Dave McQuown. (2021, March 8). Data Visualization using Illinois Unemployment Insurance Data. Zenodo. https://doi.org/10.5281/zenodo.4589040